Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read data from an hdf rather than a csv #29

Merged
merged 5 commits into from
Apr 5, 2023
Merged

Conversation

rmudambi
Copy link
Collaborator

@rmudambi rmudambi commented Apr 4, 2023

Read data from HDF rather than CSV

Description

  • Category: feature
  • JIRA issue: MIC-3942

Read data in from HDF rather than CSV
Fix errors in incorrect_select_options.csv
Fix issue with categorical dtypes by using NA instead of "" for missing data
Cast column to str dtype for typographic errors.

Testing

Ran integration tests against sample data generated by the updated simulation which outputs hdfs.
Ran automated test suite

column.loc[to_noise_idx], configuration, randomness_stream, additional_key
)

column.loc[to_noise_idx] = noised_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this actually causing a problem or do you just find this more readable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just made debugging easier since I could put a breakpoint between the function call and the assignment to the series.

data = pd.read_csv(path, dtype=str, keep_default_na=False)
data = pd.read_hdf(path)
if not isinstance(data, pd.DataFrame):
raise TypeError(f"File located at {path} must contain a pandas DataFrame.")
Copy link
Contributor

@stevebachmeier stevebachmeier Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving into a load_data utility function so that we don't forget to check type.

@@ -303,6 +303,7 @@ def keyboard_corrupt(truth, corrupted_pr, addl_pr, rng):
include_original_token_level = configuration.include_original_token_level

rng = np.random.default_rng(seed=randomness_stream.seed)
column = column.astype(str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will convert any NaNs to "nan" and proceed to corrupt that. We shouldn't have any NaNs at this point though, right? B/c those get dropped up front when this gets called?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct, but definitely great to call this out

@rmudambi rmudambi merged commit 0c4290a into develop Apr 5, 2023
@rmudambi rmudambi deleted the feature/read-hdf branch April 5, 2023 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants